Based on the wordcloud on the left side, love, story, christmas, world are the most frequent word used in the title of movies. Therefore, we could conclude that christmas movies always occupy a large proportion of all movies. Also, romantic movies is a main type in movie market. Besides these two types, by combining words like rangers, super, world and power, we assume that hero movies such as movies related to Marvel Cinematic Universe hit the market.
The growth in number of movies on netflix is much higher than that od TV shows. About 1300 new movies were added in both 2018 and 2019. The growth in content started from 2013. Netflix kept on adding different movies and TV shows on its platform over the years. This content was of different variety - content from different countries, content which was released over the years.
The growth rate for International Movies, Dramas and Comedies are the top 3. Besides these three, the growth rate for International TV Shows, Independent Movies and Action & Adventure are also very fast compared to the rest of categoreis.
This dataset consists of TV shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine.
In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. Therefore, the objective of the visual analysis is to explore what all other insights can be obtained from this data set, which include but not limited to questions like:
The application is layout is produced with the flexdashboard package, and the charts,maps and network use Plotly, ggplot2, igraph, and wordcloud, all accessed through their corresponding R packages.
---
title: "Netflix Dataset Analysis"
output:
flexdashboard::flex_dashboard:
orientation: rows
vertical_layout: fill
storyboard: true
social: menu
source_code: embed
---
```{r setup, include=FALSE}
library(stringr)
library(dplyr)
library(ggplot2)
library(RColorBrewer)
library(forcats)
library(ggthemes)
library(tidyverse)
library(wordcloud)
library(tokenizers)
library(countrycode)
library(tm)
library(proxy)
library(igraph)
library(RColorBrewer)
library(plotly)
# Read the data
df <- read.csv('/Users/huxin/Desktop/netflix_titles.csv')
# Data Preprocess
df['year'] <- str_sub(df$date_added,-4,-1) # add a new year column
df['season_count'] <- ifelse(grepl("Season", df$duration) == 'TRUE', str_sub(df$duration,0,1), NA) # Split the duration column into 2 columns
df['duration'] <- ifelse(grepl("Season", df$duration) == 'FALSE', sapply(strsplit(as.character(df$duration), " "), "[", 1), NA)
```
EDA
=======================================================================
Inputs {.sidebar}
-----------------------------------------------------------------------
Chart 1 (Top Left): From this figure, we could find that for the most frequent actors,
the international movies is the most frequent genre. Dramas and Comedies are also very
popular.
Chart 2 (Top Right): when exploring the difference in duration, we see that for top 5 countries
with the most contents of movies and TV shows, the distribution for Canada, Japan and United
Kingdom are very similar. While India generally has longer duration than the rest of countries. US
also has some outliers which have longer duration.
Chart 3 (Bottom Left): Through this simple pie chart, it's easy to observe the contribution of Netflix
content by type. Movies account for a larger proportion compared to TV shows.
Chart 4 (Bottom Right): From the distribution of different ratings, the largest count of movies is made with the 'TV-MA' rating. "TV-MA" is a rating assigned by the TV Parental Guidelines to a television program that was designed for mature audiences only. Second largest is the 'TV-14' stands for content that may be inappropriate for children younger than 14 years of age. Third largest is the very popular 'R' rating. An R-rated film is a film that has been assessed as having material which may be unsuitable for children under the age of 17 by the Motion Picture Association of America.
Row{data-height=200}
-------------------------------------
### Genre distribution for 6 most frequent actors
```{r,fig.width=7}
# What kind of genres do top actors (by frequency) belong to?
df_cast <- df %>%
mutate(cast = strsplit(as.character(cast), ",")) %>%
unnest(cast) %>%
mutate(cast = trimws(cast, which = c("left"))) %>% #eliminate space on the left side
group_by(cast) %>%
add_tally() %>%
select(cast,n,listed_in) %>%
unique()
df_actor_top <- df_cast[order(-df_cast$n),]
#count the genres
df_actor_top_genre <- df_actor_top %>%
select(cast, listed_in) %>%
mutate(listed_in = strsplit(as.character(listed_in), ",")) %>%
unnest(listed_in) %>%
mutate(listed_in = trimws(listed_in, which = c("left"))) %>% #eliminate space on the left side
group_by(cast,listed_in) %>%
add_tally() %>%
unique()
df_actor_top_only <- df_actor_top[,1:2] %>%
unique()
df_actor_top_only <- df_actor_top_only[1:30,]
df_actor_top_only <- df_actor_top_only[order(-df_actor_top_only$n),][1:6,]
df_actor_top_5_genre <- df_actor_top_genre[df_actor_top_genre$cast %in% df_actor_top_only$cast,]
mycolors <- colorRampPalette(brewer.pal(8, "Set2"))(23)
ggplot(data = df_actor_top_5_genre, aes(x = "", y = n, fill = listed_in )) +
facet_wrap(~ cast) +
geom_bar(stat = "identity",position = position_fill()) +
coord_polar(theta = "y") +
theme_tufte() +
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank())+
theme(axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank())+
theme(legend.title = element_blank())+
theme(plot.title = element_text(size = 15)) +
scale_fill_manual(values = mycolors)
```
### Distribution of Duration for Countries with the Most Contents
```{r,fig.width=7}
# Duration distribution for five top countries
df <- df[(!df$country == ""), ]
top_country <- df %>%
group_by(country) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
top_n(5)
df_top <- df[df$country %in% top_country$country,]
df_top <- df_top[!is.na(df_top$duration), ]
df_top$duration <- as.integer(df_top$duration)
ggplot(df_top, aes(x = country, y = duration, fill = country)) +
geom_violin() +
geom_boxplot(width = 0.1) +
labs(x = "", y = "Duration(minutes)") +
theme_tufte() +
theme(plot.title = element_text(size = 16),
axis.text.x = element_text(size = 10),
axis.text.y = element_text(size = 10),
axis.title = element_text(size = 13))
```
Row{data-height=200}
-------------------------------------
### Amount Of Netflix Content By Type
```{r,fig.width=7}
amount_by_type <- df %>% group_by(type) %>% summarise(
count = n()
)
fig1 <- plot_ly(amount_by_type, labels = ~type, values = ~count, type = 'pie', marker = list(colors = c("#bd3939", "#399ba3")))
fig1 <- fig1 %>% layout(
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
fig1
```
### Movie Rating Analysis
```{r,fig.width=7}
# Bar plot for distribution of ratings
df <- df[(!df$rating == ""), ]
colourCount = length(unique(df$rating))
getPalette = colorRampPalette(brewer.pal(2, "Set1"))
ggplot(df, aes(x = fct_infreq(factor(rating)),fill = factor(rating))) +
geom_bar(stat = "count", width = .5) +
scale_fill_manual(values = getPalette(colourCount)) +
labs(x = 'rating', subtitle = "Count vs. Rating") +
theme_tufte() +
theme(axis.text.x = element_text(angle = 65, vjust = 0.6)) +
theme(legend.position = "none")
```
Text Visualization{.storyboard data-navmenu="More Visualizations"}
=======================================================================
### Most frequent word used in movie titles
```{r}
# Most frequent word used in movie titles
tot_title <- paste(df[,3],collapse = " ") #All titles into one
tot_title_words <- tokenize_words(tot_title) #Tokenize sentence to words
words.freq <- table(unlist(tot_title_words))
result <- cbind.data.frame(words = names(words.freq) ,amount = as.integer(words.freq))
result_dec <- result[order(-result$amount),]
result_dec_filter <- result_dec %>%
filter(nchar( as.character(words)) > 3) #Filter out the useless characters
wordcloud(words = result_dec_filter$word, freq = result_dec_filter$amount, min.freq = 1, max.words = 150, random.order = FALSE, rot.per = 0.35, colors = brewer.pal(8,"Set2"), main="Most frequent word used in movie titles")
```
***
Based on the wordcloud on the left side, love, story, christmas, world are the most frequent word used in
the title of movies. Therefore, we could conclude that christmas movies always occupy a large proportion
of all movies. Also, romantic movies is a main type in movie market. Besides these two types, by combining
words like rangers, super, world and power, we assume that hero movies such as movies related to Marvel
Cinematic Universe hit the market.
### Through this bar plot, we see the distribution of length of movie/shows titles. The average lenth is 3 words,and the longest title has 17 words
```{r}
# How long are the titles movie/shows? And which one is the longest?
library(stringr)
title_length <- sapply(strsplit(as.character(df[,3]), " "), length)
title_length <- cbind.data.frame(num = title_length, titles = df[,3] )
mean_title <- mean(title_length$num)
max_title <- max(title_length$num)
Longest_title <- title_length %>%
filter(num == max_title)
a <- ggplot(title_length, aes(num)) +
geom_bar(fill = "rosybrown3") +
theme_bw() +
xlab("Title length ( # words)") + ylab("Frequency") +
theme(legend.position = "none") +
labs(title = "Length distribution of shows titles")+
geom_segment(aes(x = 3, y = 0, xend = 3, yend = 1500), linetype = "dashed") +
ggplot2::annotate("text", x = 5, y = 1500, label = "Average",color = "black", size = 4)
a
```
Time Series{.storyboard data-navmenu="More Visualizations"}
=======================================================================
### By Type (Movies/TV)
```{r}
# The number of Movie and TV shows over years
df['date_added'] <- as.Date(df$date_added,format = "%B %d, %Y")
Time_Period <- df %>%
group_by(type,date_added) %>%
summarise(count = n()) %>%
mutate(total_shows = cumsum(count)) #Get the cumulative sum of movie and TV shows over years
library(plotly)
fig4 <- plot_ly(Time_Period, x = ~date_added, y = ~total_shows, color = ~type, type = 'scatter', mode = 'lines', colors = c("#bd3939", "#9addbd", "#399ba3"))
fig4 <- fig4 %>% layout(yaxis = list(title = 'Count'), xaxis = list(title = 'Date'), title = "Amout Of Content As A Function Of Time")
fig4
```
***
The growth in number of movies on netflix is much higher than that od TV shows. About 1300 new movies were added in both 2018 and 2019. The growth in content started from 2013. Netflix kept on adding different movies and TV shows on its platform over the years. This content was of different variety - content from different countries, content which was released over the years.
### By Movie/TV Category{data-commentary-width=200}
```{r}
df_category <- df %>%
mutate(listed_in = strsplit(as.character(listed_in), ",")) %>%
unnest(listed_in) %>%
mutate(listed_in = trimws(listed_in, which = c("left"))) %>% #eliminate space on the left side
group_by(listed_in,date_added) %>%
summarise(count = n()) %>%
mutate(total_shows = cumsum(count))
fig5 <- plot_ly(df_category, x = ~date_added, y = ~total_shows, color = ~listed_in, type = 'scatter', mode = 'lines')
fig5 <- fig5 %>% layout(yaxis = list(title = 'Count'), xaxis = list(title = 'Date'), title = "Amout Of Content As A Function Of Time (By Category)")
fig5
```
***
The growth rate for International Movies, Dramas and Comedies are the top 3. Besides these three, the growth rate for International TV Shows, Independent Movies and Action & Adventure are also very fast compared to the rest of categoreis.
Geographical{data-navmenu="More Visualizations"}
=======================================================================
Inputs {.sidebar}
-----------------------------------------------------------------------
The map shows a comparison of contents by country. United States, Canada, Inida, United Kingdom and Japan have the most of contents. Except the top 5 countries, China, France, Australia and Spain also have competitive amounts of movies/TV shows.
Column
-----------------------------------------------------------------------
### Map
```{r}
# Map
df_country <- df %>%
mutate(country = strsplit(as.character(country), ",")) %>%
unnest(country) %>%
mutate(country = trimws(country, which = c("left"))) %>% #eliminate space on the left side
group_by(country) %>%
add_tally() %>%
select(country,n) %>%
unique()
df_country['Code'] <- countrycode(df_country$country,"country.name", "iso3c")
df_country <- na.omit(df_country)
map <- plot_ly(df_country , type = 'choropleth', locations = df_country$Code, z = df_country$n, text = df_country$country, colorscale = 'Inferno')
map <- map %>% layout(
title = 'Comparison of Contents by Country
(Cumulative from 2008 to 2020)')
map <- map %>% colorbar(title = 'Number of Contents')
map
```
Network{.storyboard data-navmenu="More Visualizations"}
=======================================================================
Inputs {.sidebar}
-----------------------------------------------------------------------
When finding similar movies for movie "The Irishman", we have built a document-term matrix for the description of movies first. Then by computing the cosine similarity, we found three most similar movies which have top 3 similarity scores. The network shows the relationship between these four movies. Nodes we used include cast, country, director, listed(category), movie name, rating, release year, type(movie/TV shows) and cluster.
In this network, there are some common cast between these movies, and two of them share the same director. Also, the same type is shared between few of them. We could see there are indeed some common chracteristics between "The Irishman" and three similar movies.
Network
-----------------------------------------------------------------------
```{r}
df_US <- df %>%
filter(country == 'United States', release_year > 2015, type == 'Movie')
# Make a corpus from the column containing the document text
source <- VectorSource(df_US$description)
corpus <- Corpus(source)
# Take the standard steps to clean and prepare the data
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords('english'))
# Create a document-term matrix:
mat <- DocumentTermMatrix(corpus)
mat4 <- weightTfIdf(mat)
mat4 <- as.matrix(mat4)
# normalize the T-Ida scores by euclidean distance
norm_eucl <- function(m)
m/apply(m,1,function(x) sum(x^2)^.5)
mat_norm <- norm_eucl(mat4)
set.seed(5)
k <- 10
kmeansResult <- kmeans(mat_norm, k)
# Cbind the column of cluster
df_cluster <- cbind(df_US, cluster = kmeansResult$cluster)
```
```{r}
# Find the Cosine similarity
cos_sim = function(matrix){
numerator = matrix %*% t(matrix)
A = sqrt(apply(matrix^2, 1, sum))
denumerator = A %*% t(A)
return(numerator / denumerator)
}
dfSim <- cos_sim(mat4)
# Remove the diagonal line
diag(dfSim) <- NA
# Find the three most similar movie for one movie
find_similar = function(movie){
similar = df_cluster[df_cluster$title == movie,]
index = which(df_cluster$title == movie, arr.ind = TRUE)
top_3 = as.data.frame(sort(dfSim[index,], decreasing = TRUE)[1:3])
for (i in 1:3) {
index = as.integer(rownames(top_3)[i])
similar = rbind(similar,df_cluster[index,])
}
return(similar)
}
```
```{r}
sim <- find_similar('The Irishman')
# Drop irrelavent columns
drops <- c("show_id","date_added","duration","description","season_count","year")
sim <- sim[ , !(names(sim) %in% drops)]
# Replace empty space with NA
sim <- sim %>% mutate_all(na_if,"")
sim[is.na(sim)] <- "Not shown"
# Add Cast
sim_cast <- sim %>% select(title,cast)
edge_list <- sim_cast %>%
mutate(cast = strsplit(cast, ",")) %>%
unnest(cast) %>%
mutate(cast = trimws(cast, which = c("left")))
edge_list <- as.data.frame(edge_list)
colnames(edge_list) <- c("source", "target")
edge_list['attribute'] <- "cast"
# Add Listed_in
sim_list <- sim %>% select(title,listed_in)
edge_list_2 <- sim_list %>%
mutate(listed_in = strsplit(listed_in, ",")) %>%
unnest(listed_in) %>%
mutate(listed_in = trimws(listed_in, which = c("left")))
colnames(edge_list_2) <- c("source", "target")
edge_list_2['attribute'] <- "listed"
edge_list <- rbind(edge_list, edge_list_2)
```
```{r,fig.width=9}
column_names = colnames(sim[,-c(2,4,8)])
# Create empty dataframe
result = data.frame(matrix(ncol = 2, nrow = 0))
x <- c("source", "target")
colnames(result) <- x
for (i in column_names) {
# Add nodes
sim_new <- sim %>% select(title,i)
edge_list_new <- sim_new %>%
mutate(i = strsplit(i, ",")) %>%
unnest(i) %>%
mutate(i = trimws(i, which = c("left")))
# Change Name
colnames(edge_list_new) <- c("source", "target")
# Rbind
result = rbind(result,edge_list_new)
result = as.data.frame(result)
}
colnames(result) <- c("source", "target","attribute")
edge_list <- rbind(edge_list,result)
write.csv(edge_list,'edge_list.csv')
# Create attribute list
attributes <- read.csv('/Users/huxin/Desktop/attributes.csv')
# Network
net <- graph_from_data_frame(edge_list)
net <- set_vertex_attr(net, "type", index = V(net), as.character(attributes$attribute))
colrs <- brewer.pal(9, "Paired")
my_color <- colrs[as.numeric(as.factor(V(net)$type))]
par(mar = c(0,0,1,0) + .1)
plot(net,vertex.size = 13, vertex.frame.color = "gray",vertex.label.color = "black",
vertex.label.cex = 0.4, vertex.label.dist = 0, vertex.size = 43, edge.arrow.size = 0.2, vertex.color = my_color)
legend("bottomleft", legend = levels(as.factor(V(net)$type)), col = colrs , bty = "n", pch = 20 , pt.cex = 2.5, cex = 0.8, text.col = colrs , horiz = FALSE, inset = c(0.05, 0.05))
title("Similar movies for The Irishman",cex.main = 1.1)
```
About
=======================================================================
This dataset consists of TV shows and movies available on [Netflix as of 2019](https://www.kaggle.com/shivamb/netflix-shows). The dataset is collected from Flixable which is a third-party Netflix search engine.
In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. Therefore, the objective of the visual analysis is to explore what all other insights can be obtained from this data set, which include but not limited to questions like:
1. What's the time duration for movies/TV shows from different countries?
2. What kind of genres do top actors (by frequency) belong to?
3. What are the most frequent words used in movie titles?
4. What's the growth rate for movie and TV shows over years?
The application is layout is produced with the [flexdashboard](https://rmarkdown.rstudio.com/flexdashboard/) package, and the charts,maps and network use [Plotly]("https://plotly.com/"), [ggplot2]("https://plotly.com/ggplot2/"), [igraph]("https://igraph.org/r/"), and [wordcloud](https://www.r-graph-gallery.com/wordcloud.html), all accessed through their corresponding R packages.